Technique to enable store forwarding during long latency instruction execution

ABSTRACT

A technique to allow independent loads to be satisfied during high-latency instruction processing. Embodiments of the invention relate to a technique in which a storage structure is used to hold store operations in program order while independent load instructions are satisfied during a time in which a high-latency instruction is being processed. After the high-latency instruction is processed, the store operations can be restored in program order without searching the storage structure.

FIELD

Embodiments of the invention relate to microprocessors andmicroprocessor systems. More particularly, embodiments of the inventionrelate to a technique to enable store data to be forwarded to loadinstructions having to search a queue for the store to be forwarded.

BACKGROUND

When performing load and store instructions, typical prior artmicroprocessors rely on a searchable queue containingcontent-addressable memory (CAM) logic, to enforce ordering among memoryoperations and for forwarding data corresponding to store instructionsto load instructions while high-latency instructions are accessing datafrom memory (“pending”). High latency instructions can result from theinstruction having to resort to a memory structure having a relativelyslow access time, such as dynamic random access memory (DRAM), if thecorresponding data is not present in a relatively faster memorystructure, such as a cache memory. The lack of the desired data within aparticular memory structure is commonly referred to as a “miss”, whilethe presence of the data within a memory structure is commonly referredto as a “hit”.

FIG. 1 illustrates a prior art processor architecture including logicfor servicing instructions that are independent of a high-latencyinstruction. The prior art architecture of Figure can serviceinstructions continuously without stalling the processor, includinginstructions that are independent of long-latency instructions, such asloads that are accessing data from a relatively slow memory source(e.g., DRAM). In particular, instructions decoded by the instructiondecoder and allocated registers by the allocate and register renamer arestored as micro-operations (uops) in uop queues, from which they arescheduled for execution by the functional units and committed to theregister file.

The prior art architecture of FIG. 1 allows miss-independentinstructions to use register file and scheduler resources by forcinglong-latency instructions and those instructions dependent upon thelong-latency instructions to relent scheduling and register fileresources until the miss can be serviced. This allows miss-independentinstructions to execute and complete without being blocked bylong-latency instructions or their dependents.

Instructions dependent on the long-latency instruction, in FIG. 1, aretemporarily stored in a wait buffer, while independent instructions areserviced during the pendency of the long-latency instruction. However,in order to ensure correct memory ordering, all store instructionsconcurrently in process (“in flight”) must be stored during the pendencyof the long-latency instruction, typically requiring large store queues(e.g., L1 and L2 store queues). These store queues can grow withincreased instruction processing.

Moreover, in order to search these store queues, extra logic, such asCAM logic, may be necessary. Particularly, load operations searching fora corresponding store operation having data to satisfy the loadoperations, typically search a relatively large store queue using CAMlogic that increases in size with the size of the queue.

Searching a large store queue that has CAM logic can potentiallyincrease cycle time or increase the number of cycles it takes to accessthe store queue. Further, using searchable store queues to forward storedata to the proper load instruction can become increasingly difficult toaccommodate as the number of in-flight instructions increase duringprocessing of a long-latency instruction, such as a load servicing amiss. Moreover, search logic, such as CAM logic, typically associatedwith searchable store queues can require excess power, die real estate,and processing cycles in order to satisfy independent load operationsduring other pending long-latency operations.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention are illustrated by way of example and notlimitation in the figures of the accompanying drawings, in which likereferences indicate similar elements and in which:

FIG. 1 illustrates a prior art architecture for satisfying independentoperations during pending long-latency memory access operations, such asload operations, within a microprocessor.

FIG. 2 illustrates an architecture, according to one embodiment of theinvention, to allow independent operations to be satisfied without usinga searchable memory structure.

FIG. 3 is a flow diagram illustrating operations that may be involved inone embodiment of the invention.

FIG. 4 illustrates an architecture, according to one embodiment,including a loose-check filter to allow independent operations to besatisfied without using a searchable memory structure.

FIG. 5 is shared bus computer system in which one embodiment of theinvention may be used.

FIG. 6 is a point-to-point computer system in which one embodiment ofthe invention may be used.

FIG. 7 illustrates an architecture, according to one embodiment, using aload buffer to store load addresses corresponding to matching storeaddresses.

FIG. 8 illustrates an entry of the load buffer of FIG. 7, according toone embodiment.

DETAILED DESCRIPTION

Embodiments of the invention relate to a technique to serviceinstructions that are independent of high-latency instructions, such asa load instruction accessing DRAM to retrieve load data. Moreparticularly, embodiments of the invention relate to a technique tomatch store data to load operations while high-latency operations arepending without using a searchable memory device.

At least one embodiment of the invention replaces a second-level (“L2”)queue having associated content addressable memory (CAM) logic in favorof a first-in-first-out (FIFO) queue that holds store operations (bothdependent and independent) in the shadow of a long-latency operation oroperations that is/are being performed at a certain time. A FIFO queuehas the potential to use less power, because the FIFO queue does notrequire a search and associated search logic, as does a typical CAMstructure. Furthermore, independent loads are forwarded data from the L1data cache to help maintain performance in at least one embodiment.

FIG. 2 illustrates one embodiment of the invention in which a FIFO queueis used to store stores while an operation or operations is accessing arelatively slow-access memory, such as dynamic random access memory(DRAM), due to a cache miss. More particularly, FIG. 2 illustrates afirst-level (“L1”) store queue 201 in which various instructions ormicro-operations are stored. If an operation, such as an instruction ormicro-operation, cannot retrieve its required data from a relativelyfast-access memory, such as a data cache, thereby creating a cache misscondition, it may resort to other memory, such as DRAM or an upperlevel, and often slower, cache memory to retrieve the data. Anoperation, such as an instruction or micro-operation, that accesses arelatively slow-access memory source, such as DRAM, is hereafterreferred to as a “high-latency” operation, instruction, ormicro-operation. During the time that the high-latency operation isattempting to retrieve the memory from another source, operations thatare independent of the high-latency operation (hereinafter referred toas “independent operations”) should not be gated by the high-latencyoperation or operations dependent upon the high-latency operation(hereinafter “dependent operations”), but should be allowed to complete,as they are not dependent on the high-latency operation.

Accordingly, FIG. 2 illustrates a store redo log (SRL) 205 to storeoperations that occur after the long-latency operation in program order.Unlike the prior art, the SRL does not contain CAM logic or any otherlogic necessary to search the SRL, but instead the SRL storesindependent and dependent store operations in the order in which theyappear in a program, so that they may be read out in program order whenneeded. In one embodiment, the SRL is a FIFO queue. In otherembodiments, however, the SRL may be other memory structures that arenot required to be searched in order for instructions ormicro-operations to be retrieved from the memory structure.

In one embodiment of the invention, independent load operations may besatisfied by independent store operations from the L1 store queue 201 orfrom the L1data cache 210 before the high-latency operations issatisfied, by storing the desired independent store data in the L1 storequeue or in the L1 data cache 210. The data cache can act as a temporarystorage location for the data to be used by the independent loadoperations, and may be replaced with data corresponding to thelong-latency operation or dependent operations after the high-latencyoperation is complete, depending on program order. Furthermore, any datathat was stored previously in a location (“dirty blocks”) in the datacache that is being written by an independent store operation, can bestored away into another memory structure, such as an L2 cache, andreturned to the data cache after the long latency operation completes.

FIG. 2 also illustrates a wait buffer to store the dependent operationswhile the high-latency operation is still pending. The wait buffer maybe a FIFO in some embodiments. However, in other embodiments, the waitbuffer may be other types of memory structures. After the high-latencyoperation is satisfied, dependent and independent data (including dataalready once written to the data cache to satisfy load operations) arebe reassembled in program order within the data cache. In one embodimentof the invention, the data is reassembled in the data cache by writingthe data from the SRL in program order. After the data corresponding tothe high-latency operation is written to the appropriate location in thedata cache, the data of store operations stored in the SRL may besequentially read out of the SRL and stored into the data cache in theappropriate location without having to search the SRL for theappropriate data.

Because all data, corresponding to both independent and dependentoperations, can be read sequentially from the SRL, the data can beretrieved faster and with less power consumed than in prior arttechniques using memory structures, such as an L2 store queue, to storethe independent and dependent instructions or micro-operations.Furthermore, the SRL may be smaller than prior art structures, as theSRL contains no search logic, such as CAM logic.

FIG. 3 is a flow diagram illustrating a sequence of operations that maybe used in one embodiment of the invention to carry out an embodiment,such as the one illustrated FIG. 2. At operation 301, a high-latencyinstruction or micro-operation is encountered, causing the instructionto resort to relatively high latency memory, such as DRAM, to retrievedata that the instruction or micro-operation requires. While thehigh-latency instruction or micro-operation is accessing the data itneeds (i.e. the instruction or micro-operation is “pending”),instructions or micro-operations dependent on the high-latencyinstruction are stored in a wait buffer and store instructions ormicro-operations dependent on and independent of the high-latencyinstruction or micro-operation is stored in an SRL at operation 305.

Instructions or micro-operations independent of the high-latencyinstruction may write the appropriate data temporarily to a data cachefrom which the independent instructions or micro-operations may read thedata at operation 310. If the data corresponding to the independentinstructions is written to a dirty block at operation 315, then the datain the dirty block is temporarily stored in another memory, such as anL2 cache, at operation 320. After the high-latency instruction hasretrieved its data, the independent data and dependent data mayreassembled in program order by copying the data from instructions ormicro-operations stored in the wait buffer to its appropriate locationin the SRL and then writing the data of store instructions ormicro-operations stored in the SRL to the appropriate location in thedata cache, such that the data is stored in program order at operation325.

FIG. 4 illustrates one embodiment of the invention in which aloose-check filter (LCF) 403 is used to discern whether the SRL 405contains instructions or micro-operations having data to satisfy aparticular load operation. The output of the LCF is connected to a mux407, which can select data directly from the SRL without the data havingto first be copied to the data cache 410 as in the embodimentillustrated in FIG. 3. As in the embodiment of FIG. 3, the data may beselected and written to a register file (not shown) via mux 413 from theL1 queue 401 or the data cache.

In one embodiment of the invention, the LCF is a direct-mappednon-tagged counter array indexed by a hash function of a memory address.An insertion of store operations into the SRL increments thecorresponding LCF counter and a store operation removal from the SRLdecrements the LCF counter. A non-zero LCF counter value suggests apossible matching store operation in the SRL, while a zero LCF countervalue guarantees the absence of a matching store operation. By allowingload operations to match data in the SRL through the LCF, loadoperations may stall only on an LCF match to a non-zero value.

The LCF may cause some load operations to stall unnecessarily, but toreduce such stall cases, indexed forwarding may be used in the SRL.Because many stalled load operations are recently fetched loadoperations appearing after the high-latency operation, but before theall store operations in the SRL have had their data stored into the datacache, a forwarding store operation in the SRL is often the lastmatching store operation inserted into the SRL. Therefore, in oneembodiment, the LCF is extended to also store the SRL index of the lastinserted store operation associated with the corresponding LCF counter.In such an embodiment, an incoming load operation needing store datacorresponding to a store operation in the SRL can quickly locate thelast potentially matching store data entry in the SRL. The loadoperation can read this entry and perform complete address and agechecking without requiring a search of the SRL via CAM logic or someother logic. Furthermore, only one address comparator may be requiredfor the entire SRL in such an embodiment.

In some instances, loads may be predicted to be independent of a highlatency instruction, but may ultimately be dependent upon a dependentstore instruction. For example, if the high latency instruction is aload instruction, subsequent load instructions that are not dependentupon the high latency load instruction may be dependent upon dependentstore instructions. In this case, a load instruction that is satisfiedwith data from an independent store instruction may by satisfied withthe wrong data if there are dependent store operations that appear priorto the satisfied load instruction in program order having more currentdata for the satisfied load instruction than the data with which theload was satisfied.

FIG. 7 illustrates one embodiment of the invention having a load bufferto store target addresses of load instructions, which can be compared totarget addresses of store instructions that are in flight and are to bestored in the SRL 705 or queue 701 to determine whether the satisfiedload instruction actually has the most current store data. Specifically,load buffer 703 is used to store target load addresses corresponding toload operations and can be compared to target store addresses of storeinstructions in route to the SRL or the queue, effectively “snooping”the in-flight store target address.

The target addresses stored in the load buffer are compared againstthose of the store instructions in route to the SRL or queue, and amatching entry within the load buffer that is subsequent to a storeinstruction in program order sharing the same target address indicatesthat the load instruction is not to be satisfied with the data itretrieved from the data cache 710. In one embodiment, the load buffer isa set-associative buffer, whereas in other embodiments other storagestructures may be used.

If a store is encountered within the SRL that does not correspond to thestore data retrieved by the load operation from the data cache andsubsequent in program order to the instruction corresponding to theretrieved from the data cache but prior to in program order the loadoperation, a misprediction recovery scheme can be used to retrieve themost current store data. In one embodiment, the load instruction can besatisfied with the most current store data in the SRL to which itcorresponds by flushing the processing pipeline of instructions back toa certain point in execution order indicated by a checkpoint stored inan entry within the load buffer.

FIG. 8 illustrates an entry of the load buffer, according to oneembodiment, in which each load buffer entry contains a physical addresstag 801 that is compared against the corresponding address fields of thestore operations in the SRL. Each entry also contains a valid bit 805,to indicate whether the entry contains a valid load address, anidentification field 810 to indicate where execution is to resume aftera load operation is incorrectly satisfied with store data from the datacache (“checkpoint”), and a store buffer identification field 815 toindicate the address of a prior store instruction closest to the loadinstruction in program order, and a store buffer identification field817 indicating a store that from which the load was previouslysatisfied.

In one embodiment of the invention, the load buffer entries are checkedagainst the store operations stored in the SRL before each loadoperation to be satisfied reads data from the data cache. In otherembodiments, this check is done after a load operation retrieves datafrom the data cache. Because the load buffer is set associative, insteadof fully associative, for example, at least one embodiment is able tocompare load addresses to corresponding store address in the SRLrelatively quickly, such that the comparison can be made during a memoryaccess cycle of a high-latency load operation.

FIG. 5 illustrates a front-side-bus (FSB) computer system in which oneembodiment of the invention may be used. A processor 505 accesses datafrom a level one (L1) cache memory 510 and main memory 515. In otherembodiments of the invention, the cache memory may be a level two (L2)cache or other memory within a computer system memory hierarchy.Furthermore, in some embodiments, the computer system of FIG. 5 maycontain both a L1 cache and an L2 cache, which comprise an inclusivecache hierarchy in which coherency data is shared between the L1 and L2caches.

Illustrated within the processor of FIG. 5 is one embodiment of theinvention 506. In some embodiments, the processor of FIG. 5 may be amulti-core processor.

The main memory may be implemented in various memory sources, such asdynamic random-access memory (DRAM), a hard disk drive (HDD) 520, or amemory source located remotely from the computer system via networkinterface 530 containing various storage devices and technologies. Thecache memory may be located either within the processor or in closeproximity to the processor, such as on the processor's local bus 507.Furthermore, the cache memory may contain relatively fast memory cells,such as a six-transistor (6T) cell, or other memory cell ofapproximately equal or faster access speed.

The computer system of FIG. 5 may be a point-to-point (PtP) network ofbus agents, such as microprocessors, that communicate via bus signalsdedicated to each agent on the PtP network. Within, or at leastassociated with, each bus agent is at least one embodiment of invention506, such that store operations can be facilitated in an expeditiousmanner between the bus agents.

FIG. 6 illustrates a computer system that is arranged in apoint-to-point (PtP) configuration. In particular, FIG. 6 shows a systemwhere processors, memory, and input/output devices 614 areinterconnected by a number of point-to-point interfaces.

The system of FIG. 6 may also include several processors, of which onlytwo, processors 670, 680 are shown for clarity. Processors 670, 680 mayeach include a local memory controller hub (MCH) 672, 682 to connectwith memory 62, 64. Processors 670, 680 may exchange data via apoint-to-point (PtP) interface 650 using PtP interface circuits 678,688. Processors 670, 680 may each exchange data with a chipset 690 viaindividual PtP interfaces 652, 654 using point to point interfacecircuits 676, 694, 686, 698. Chipset 690 may also exchange data with ahigh-performance graphics circuit 638 via a high-performance graphicsinterface 692 via coupling 639.

At least one embodiment of the invention may be located within theprocessors 670 and 680. Other embodiments of the invention, however, mayexist in other circuits, logic units, or devices within the system ofFIG. 6. Furthermore, other embodiments of the invention may bedistributed throughout several circuits, logic units, or devicesillustrated in FIG. 6. Furthermore, as shown in FIG. 6, the system mayinclude a bus 616, a bus bridge 618, a bus 620, a keyboard/mouse 622, anaudio I/O 624, communication device 626, data storage 628, code 630(shown on the data storage 628), and processor cors 674, 684.

Embodiments of the invention described herein may be implemented withcircuits using complementary metal-oxide-semiconductor devices, or“hardware”, or using a set of instructions stored in a medium that whenexecuted by a machine, such as a processor, perform operationsassociated with embodiments of the invention, or “software”.Alternatively, embodiments of the invention may be implemented using acombination of hardware and software.

While the invention has been described with reference to illustrativeembodiments, this description is not intended to be construed in alimiting sense. Various modifications of the illustrative embodiments,as well as other embodiments, which are apparent to persons skilled inthe art to which the invention pertains are deemed to lie within thespirit and scope of the invention.

1. An apparatus comprising: a first storage device to store a pluralityof store operations in program order during a time in which ahigh-latency operation is accessing data from a second storage device,wherein the high-latency operation is to be issued in response to acache miss and wherein the plurality of store operations may comprise adependent store operation, dependent on the high-latency operation, andan independent store operation, independent of the high-latencyoperation; and a third storage device to store an independent loadoperation, wherein the independent load operation is to retrieve datacorresponding to an independent store operation after the high-latencyoperation has been issued and before the high-latency operation hasaccessed data from the second storage device and without searching thefirst storage device.
 2. The apparatus of claim 1 further comprising afourth storage device to store data associated with the plurality ofstore operations after the high-latency operation has accessed the datafrom the second storage device.
 3. The apparatus of claim 1 wherein theindependent load operation is to access the data corresponding to theoperation stored in a fourth storage device.
 4. The apparatus of claim 3wherein the first storage device comprises a first-in-first out queueand the fourth storage device comprises a level-1 (L1) store queue. 5.The apparatus of claim 4 wherein the second storage device comprisesdynamic random access memory (DRAM).
 6. The apparatus of claim 5 furthercomprising loose count filter to allow independent load operations to besatisfied without reading data from the third device.
 7. The apparatusof claim 1, wherein data associated with the dependent store operationsand data associated with independent store operations are to bereassembled based on the program order.
 8. The apparatus of claim 1,wherein data for the load instruction is to be retrieved fromcorresponding store data in the first storage device in response to apipeline flush to a point in execution order, wherein the point is to beindicated by a checkpoint stored in an entry within a load buffer.
 9. Amethod comprising: issuing a high-latency instruction in response to acache miss; storing a plurality of store instructions that aresubsequent to the high-latency instruction in a first-in-first-out queuein program order, wherein the plurality of store operations may comprisea dependent store operation, dependent on the high-latency operation,and an independent store operation, independent of the high-latencyoperation; satisfying a load instruction that is independent of thehigh-latency instruction with data of a store instruction that isindependent of the high-latency instruction while the high-latencyinstruction is retrieving data and without searching thefirst-in-first-out queue.
 10. The method of claim 9 further comprisingstoring data temporarily in a data cache or a level-1 (L1) store queuewhile the high-latency instruction is retrieving data so that the loadinstruction can read the temporarily stored data.
 11. The method ofclaim 10 wherein if the data is stored to an area of the data cachecorresponding to previously stored valid data, the previously storedvalid data is temporarily stored in another memory location and laterreturned to the area of the data cache.
 12. The method of claim 10wherein the data used to satisfy the load instruction is datacorresponding to an instruction stored in the L1store queue.
 13. Themethod of claim 12 wherein the instruction stored in thefirst-in-first-out queue, having data used to satisfy the loadinstruction, is independent of the high-latency instruction.
 14. Themethod of claim 9 wherein the first-in-first-out buffer storesindependent and dependent store instructions in program order such thatthe buffer contents can be stored in a data cache in program order. 15.The method of claim 14 wherein the high-latency instruction isretrieving data from dynamic random access memory.
 16. The method ofclaim 9, further comprising reassembling data associated with thedependent store operations and data associated with independent storeoperations based on the program order.
 17. A system comprising: a memoryto store a first data; a processor to perform a high-latency operationto retrieve the first data in response to a cache miss; a queue to storea plurality of store instructions in program order while thehigh-latency operation is retrieving the first data wherein theplurality of store instructions may comprise a dependent storeoperation, dependent on the high-latency operation, and an independentstore operation, independent of the high-latency operation; a cache totemporarily comprise store data corresponding to at least one of theplurality of store instructions, the store data to satisfy a loadinstruction that is independent of the high-latency operation andwithout searching the queue; a set-associative load buffer to storeaddresses corresponding to the load instructions, which are to becompared with target addresses of the store instructions.
 18. The systemof claim 13 wherein the cache is to store the plurality of datacorresponding to the store instructions in program order after thehigh-latency instruction is complete and after the plurality of storeinstructions have been read from the queue.
 19. The system of claim 13further comprising a loose count filter to allow the load operation tobe satisfied from data stored in the queue instead of reading the dataafter it has been stored in the data cache.
 20. The system of claim 13wherein the memory to store a first data comprises dynamic random accessmemory.
 21. The system of claim 20 wherein the queue is afirst-in-first-out queue.
 22. The system of claim 21 wherein the loosecount filter comprises a directly mapped non-tagged counter arrayindexed by a hash function of a memory address.
 23. The system of claim22 wherein if a store instruction is stored in the queue, acorresponding counter in the loose count filter is incremented.
 24. Thesystem of claim 17, wherein data associated with the dependent storeinstructions and data associated with independent store instructions areto be reassembled based on the program order.
 25. A machine-readablemedium having stored thereon a set of instructions, which if executed bya machine, cause the machine to perform a method comprising: storing aplurality of store operations in program order in a first-in-first-out(FIFO) queue in response to a memory access instruction requiring morethan a minimum time to be performed, wherein the memory accessinstruction is to be issued in response to a cache miss; retrieving theplurality of store operations from the FIFO queue and storing their datain a data cache in program order without having to search the FIFO queuein response to the memory access instruction completing wherein theplurality of store operations may comprise a dependent store operation,dependent on the high-latency operation, and an independent storeoperation, independent of the high-latency operation.
 26. Themachine-readable medium of claim 25 wherein load operations independentof the memory access instruction are to access store data correspondingto store instructions stored in a level-1 (L1) store queue while thememory access instruction is being performed.
 27. The machine-readablemedium of claim 26 wherein the load operations are to access the storedata from a data cache.
 28. The machine-readable medium of claim 26wherein the load operations are to access the store data from the FIFOqueue if loose count filter indicates that the data is present in theFIFO queue.
 29. The machine-readable medium of claim 26 wherein themethod further comprises comparing an address associated with at leastone of the load operations with an address of the store operations todetermine which of the store operations corresponds to a most currentdata corresponding to the at least one load operation.
 30. Themachine-readable medium of claim 29 wherein the method further comprisesexecuting operations starting from a checkpoint corresponding to anoperation appearing before the at least one load operation in programorder.
 31. The machine-readable medium of claim 30 wherein addressesassociated with the load operations are to be stored in aset-associative buffer.
 32. The medium of claim 25, wherein dataassociated with the dependent store operations and data associated withindependent store operations are to be reassembled based on the programorder.